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Abstract 



Wc present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) 
to compute all pairwisc similarities between very high-dimensional sparse vectors. All of 
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c/2 ' our results arc provably independent of dimension, meaning that apart from the initial cost 

of trivially reading in the data, all subsequent operations are independent of the dimension; 
thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard 
similarity measures. For Jaccard similiarity we include an improved version of MinHash. 
Our results are geared toward the MapReduce framework. We empirically validate our 
theorems with large scale experiments using data from the social networking site Twitter. 
' At time of writing, our algorithms are live in production at twitter.com. 
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1. Introduction 



Computing similarity between all pairs of vectors in large-scale datasets is a challenge. 
Traditional approaches of sampling the dataset are limited and linearly dependent on the 
dimension of the data. We present an approach whose complexity is independent of the data 
dimension and geared towards modern distributed systems, in particular the MapReduce 
framework ( Dean and Ghemawat . 20081 ). 



MapReduce is a programming model for processing large data sets, typically used to 
do distributed computing on clusters of commodity computers. With large amount of 
processing power at hand, it is very tempting to solve problems by brute force. However, 
we show how to combine clever sampling techniques with the power of MapReduce to extend 
its utility. 

Consider the problem of finding all pairs of similarities between D indicator (0/1 entries) 
vectors, each of dimension N. In particular we focus on cosine similarities between all 
pairs of D vectors in R . Further assume that each dimension is L-sparse, meaning each 
dimension has at most L non-zeros across all points. For example, typical values to compute 
similarities between all pairs of a subset of Twitter users can be: N = 10 9 (the universe of 
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Twitter users), D = 10 7 (a subset of Twitter users), L = 1000. To be more specific: each 
of D users can be represented by an N dimensional vector that indicates which of the N 
universal users are followed. The sparsity parameter (L = 1000) here assumes that each 
user follows at most 1000 other users. 

There are two main complexi ty measures for MapReduc e: "shuffle size" and "reduce- 
key complexity" , defined shortly (IGoel and Munagalal . l2012l ) . It can be easily shown that 
the naive approach for computing similarities will have 0(NL 2 ) emissions, which for the 
example parameters we gave is infeasible. The number of emissions in the map phase is 
called the "shuffle size", since that data needs to be shuffled around the network to reach 
the correct reducer. Furthermore, the maximum number of items reduced to a single key 
can be as large as N. Thus the "reduce-key complexity" for the naive scheme is N. 

We can drastically reduce the shuffle size and reduce-key complexity by some clever 
sampling with the DISCO scheme described in this paper. In this case, the output of 
the reducers are random variables whose expectations are the similarities. Two proofs are 
needed to justify the effectiveness of this scheme: first, that the expectations are indeed 
correct and obtained with high probability, and second, that the shuffle size is greatly 
reduced. We prove both of these claims. In particular, in addition to correctness, we 
prove that to estimate similarities above e, the shuffle size of the DISCO scheme is only 
0(DL \og(D)/e), with no dependence on the dimension N, hence the name. 

This means as long as there are enough mappers to read the data, one can use the 
DISCO sampling scheme to make the shuffle size tractable. Furthermore, each reduce key 
gets at most 0(log(D)/e) values, thus making the reduce-key complexity tractable too. 
Within Twitter inc, we use the DISCO sampling scheme to compute similar users. We have 
also used the scheme to find highly similar pairs of words by taking each dimension to be 
the indicator vector that signals in which tweets the word appears. We empirically verify 
the proven claims and observe large reductions in shuffle size in this paper. 

Our sampling scheme can be used to implement many other similarity measures. We 
focus on four similarity measures: Cosine, Dice, Overlap, and the Jaccard similarity mea- 
sures. F or Jaccard sim iliarity we present an improved version of the well known MinHash 
scheme (iBroderl . 119971 ). Our framework operates under the following assumptions, each of 
which we justify: 

First, we focus on the case where each dimension is sparse and therefore the natural 
way to store the data is by segmenting it into dimensions. In our application each dimen- 
sion is represented by a tweet, thus this assumption was natural. Second, our sampling 
scheme requires a "background model" , meaning the magnitude of each vector is assumed 
to be known and loaded into memory. In our application this was not a hurdle, since the 
magnitudes of vectors in our corpus needed to be computed for other tasks. In general this 
may require another pass over the dataset. To further address the issue, in the streaming 
computation model, we can remove the dependence by paying an extra logarithmic factor 
in memory used. Third, we prove results on highly similar pairs, since common applications 
require thresholding the similarity score with a high threshold value. 

A ubiquitous problem is finding all pairs of objects that are in some sense similar and 
in particular more similar than a threshold. For such applications of similarity, DISCO is 
particularly helpful since higher similarity pairs are estimated with provably better accuracy. 
There are many examples, including 
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Advertiser keyword suggestions: When targeting advertisements via keywords, it is 
useful to expand the manually input set of keywords by other si milar keywords, re- 



quiri ng finding all keywords more similar than a high threshold (IRegelson and Fainl . 



2006). The vector representing a keyword will often be an indicator vector indicating 



in which dimensions the keywords appear. 

Collaborative filtering: Collaborative filtering applications require knowing which 
users have similar interests. For this reason, given a person or user, it is required 
to find all other objects more similar than a particular threshold, for which our algo- 
rithms are well suited. 

Web Search: Rewriting queries given to a search engine is a common tric k used to 
expand the coverage of the search results (jAbhishek and Hosanagarl . 120071 ). Adding 



clusters of similar queries to a given query is a natural query expansion strategy. 

These applications have been around for many years, but the scale at which we need 
to solve them keeps steadily increasing. This is particularly true with the rise of the web, 
and social networks such as Twitter, Facebook, and Google Plus among others. Google 
currently holds an index of more than 1 trillion webpages, on which duplicate detection must 
be a daunting task. For our experiments we use the Twitter dataset, where the number 
of tweets in a single day surpasses 200 million. Many of the applications, including ours 
involve domains where the dimensionality of the data far exceeds the number of points. 
In our case the dimensionality is large (more than 200 million), and we can prove that 
apart from initially reading the data (which cannot be avoided), subsequent Map Reduce 
operations will have input size independent of the number of dimensions. Therefore, as long 
as the data can be read in, we can ignore the number of dimensions. 

A common technique to deal with large datasets is to simply sample. Indeed, our 
approach too, is a sampling scheme; however, we sample in a nontrivial way so that points 
that have nonzero entries in many dimensions will be sampled with lower probability than 
points that are only present in a few dimensions. Using this idea, we can remove the 
dependence on dimensionality while being able to mathematically prove - and empirically 
verify - accuracy. 



Although we use the MapReduce (jDean and Ghemawatl . 120081 ) framework and discuss 



shuffle size, the sampling strategy we use can be generalized to other frameworks. We focus 
on MapReduce because it is the tool of choice for large scale computations at Twitter, 
Google, and many other companies dealing with large scale data. However, the DISCO 
sampling strategy is potentially useful whenever one is computing a number between 
and 1 by taking the ratio of an unknown number (the dot product in our case) by some 
known number (e.g. the magnitude). This is a high-level description, and we give four 
concrete examples, along with proofs, and exp eriments . Fur thermore, DISCO improves 



the implementation of the well-known MinHash (IBroderl . 119971 ) scheme in any MapReduce 
setup. 

2. Formal Preliminaries 

Let T = {ti, . . . , ijv} represent N documents, each of a length of no more than L words. 
In our context, the documents are tweets (documents are also the same as dimensions for 
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us) from the social networking site Twitter, or could be fixed-window contexts from a large 
corpus of documents, or any other dataset where dimensions have no more than L nonzero 
entries and the dataset is available dimension-by-dimension. We are interested in similarity 
scores between pairs of words in a dictionary containing D words {w±, . . . ,wd}- The number 
of documents in which two words Wi and Wj co-occur is denoted #(wi,Wj). The number of 
documents in which a single word w% occurs is denoted #(wi). 

To each word in the dictionary, an iV-dimensional indicator vector is assigned, indicating 
in which documents the word occurs. We operate within a MapReduce framework where 
each document is an input record. We denote the number of items output by the map phase 
as the 'shuffle size'. Our algorithms will have shuffle sizes that provably do not depend on 
N, making them particularly attractive for very high dimensional tasks. 

We focus on 5 different similarity measures, including cosine si milarity, which is very 



popu l ar and produces high qua l ity results across diffe r ent domains (IChien and Immorlical . 
2005 ; Chuang and Chien . 2005 ; Sahami and Heilman . 20061 ; Spertus et al. . 20051 ). Cosine 



similarity is simply the vector normalized dot pro duct :— ^=fc^L= where #(x) = YliLi x [i] 

and jf=(x,y) = ^2^ = ix[i]y[i]. In addition to cosine similarity, we consider many variations 
of similarity scores that use the dot product. They are outlined in Table [TJ 

To compare the performance of algorithms in a MapReduce framework, we report and 
analyze shuffle size, which is more reliable than wall time or any other implementation- 
specific measure. We define shuffle size as the total output of the Map Phase, which is what 
will need to be "shuffled" before the Reduce phase can be gin. Our results and theorems 
hold across any MapReduce imp lementation such as Hadoop jBorthakurl 120071 ) (IGates et al 



20091 ) or Google's MapReduce (|Dean and Ghemawatl . 12008 1). We focus on shuffle size be- 



cause after one trivially reads in the data via mappers, the shuffle phase is the bottleneck 
since our mappers and reducers are all linear in their input size. In general, MapReduce 
algorithms are usually judged by two performance measures: largest reduce bucket and 
shuffle size. Since the largest reduce bucket is not at all a problem for us, we focus on 
shuffle size. 

For many applications, and our applications in particular, input vectors are sparse, 
meaning that the large majority of entries are 0. A sparse vector representation for a vector 
x is the set of all pairs (i, x[i]) such that x[i] > over all i = 1 . . . N. The size of a vector 
x, which we denote as #(cc) is the number of such pairs. We focus on the case where 
each dimension is sparse and therefore the natural way to store the data is segmented into 
dimensions. 

Throughout the paper we formally prove results for pairs more similar than a threshold, 
called e. Our algorithms will often have a tradeoff between accuracy and shuffle size, where 
the tradeoff parameter is p. By design, the larger p becomes, the more accurate the estimates 
and the larger the shuffle size becomes. 

We assume that the dictionary can fit into memory but that the number of dimensions 
(documents) is so large that many machines will be needed to even hold the documents on 
disk. Note that documents and dimensions are the same thing, and we will use these terms 
interchangeably. We also assume that the magnitudes of the dimensions are known and 
available in all mappers and reducers. 



4 



Dimension Independent Similarity Computation 



Similarity 


Definition 


Shuffle Size 


Reduce-key size 


Cosine 


#(*,») 


0{DL\og{D)/e) 


0{\og{D)/e) 


Jaccard 


#(*,») 

#(x)+#(y)-#(x,y) 


0{{D/e)\og{D/e)) 


0(log(D/e)/e) 


Overlap 


#( x ,y) 

mm(#(x),#(y)) 


0{DL\og{D)/e) 


0(\og(D)/e) 


Dice 


2#{x,y) 


0{DL\og{D)/e) 


0(bg(D)/e) 



Table 1: Similarity measures and the bounds we obtain. All sizes are independent of N, 
the dimension. These are bounds for shuffle size without combining. Combining 
can only bring down these sizes. 



We are interested in several similarity measures outlined in Table [H For each, we prove 
the shuffle size for the bottleneck step of the pipeline needed to compute all nonzero scores. 
Our goal is to compute similarity scores between all pairs of words, and prove accuracy 
results for pairs which have similarity above e. The naive approach is to first compute 
the dot product between all pairs of words in a Map- Reduce ( Dean and Ghemawat . 20081 ) 



style framework. Mappers act on each document, and emit key- value pairs of the form 
(wi,Wj) — > 1. These pairs are collected with the sum reducer, which gives the dot product 
#(wi,Wj). The product is then used to trivially compute the formulas in Table [TJ The 
difficulty of computing the similarity scores lies almost entirely in computing #(wi,Wj). 

Shuffle size is defined as the number of key- value pairs emitted. The shuffle size for the 
above naive approach is Nfy = 0(NL 2 ), which can be prohibitive for N too large. Our 
algorithms remove the dependence on N by dampening the number of times popular words 
are emitted as part of a pair (e.g. see algorithm [5]) . Instead of emitting every pair, only 
some pairs are output, and instead of computing intermediary dot products, we directly 
compute the similarity score. 



3. Complexity Measures 



An in- depth discussion of complexity measures for MapReduce is given in lGoel and Munagala 



(|2012h . We focus on shuffle size and reduce-key complexity for two reasons. First, focusing 



on reduce key complexity captures the performance of an idealized Map-Reduce system 
with infinitely many mappers and reducers and no coordination overheads. A small reduce- 
key complexity gu arantees that a Map-Reduce algorithm will not suffer from "the curse of 
the last reducer" ( Suri and Vassilvitskii . 201 ll ). a phenomenon in which the average work 



done by all reducers may be small, but due to variation in the size of reduce records, the 
total wall clock time may be extremely large or, even worse, some reducers may run out of 
memory. 

Second, we focus on shuffle size since the total size of input/output by all mappers/reducers 
captures the total file I/O done by the algorithm, and is often of the order of the shuffle 
size. 
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Lastly, our complexity measures depend only on the algorithm, not on details of the Map- 
Reduce installation such as the number of machines, the nu mber of Mappers/Reducers etc., 
which is a desirable property for the analysis of algorithms (jGoel and Munagalal . 120121 ) . 



4. Related Work 



Previously th e all-pairs similarity search problem has been studied in (IBroder et all 1997; 
BrodeiJ . fl997h . in the context of identifying near-duplicate web pages (jBroder et all Il997l ) . 
We describe MinHash later in section 15.21 We improve the vanilla implementation of Min- 
Hash on MapReduce to arrive at better shuffle sizes without (effectively) any loss in accu- 
racy. We prove these results in Section 15.21 and verify them experimentally. 



There are many papers discussing the all p airs similarity computation problem. In 
( Elsaved et al. . 20081 ; Campagna and Pagh . 20121 ). the MapReduce framework is targeted, 
but there is still a dependence on the dimensionality. In ( Lin . 20091 ). all pairs are computed 
on MapReduce, but there is a focus on the life sciences domain. The all-pairs similarity 
search problem has also b een addressed in the database community , where it is known as 
the s imilarity join problem ( Arasu et al. . 20061 ; Chaudhuri et al. . 20061 ; Sarawagi and Kirpal . 
20041 ). These papers are loosely assigned to one of two categories: First, signatures of the 
points to convert the nonexact matching problem to an exact matching problem followed 
by hash map and filtering false positives. Second, inverted list based solutions exploit 
information retrieval techniques. 

In lBaradia et al.l (|2010h . the authors consider the problem of finding highly similar pairs 
of documents in MapReduce. They discuss shuffle size briefly but do not provide proven 
guarantees. They also discuss reduce key size, but again do not provide bounds on the size. 

There is large body of work on the nearest neighbor s problem, which is the prob l em of 
finding the k nearest neighbors of a given qu ery point (|Charikarl . 120021 : iFagin et all 120031 : 
Gionis et al. . 1999 : Indyk and Motwani . 19981 ). Our problem, the all-pairs similarities com- 
putation is related to, but not identical to the nearest neighbor problem. As one of the 
reviewers astutely pointed out, when only the k nearest neighbors need to be found, op- 
timizations can be made for that. In particular, there are two problems with pairwise 
similarity computation, large computation time with respect to dimensions which we ad- 
dress, and large computation time with respect to the number of points. 



In (jPantel et all l200flh . the authors propose a highly scalable term similarity algorithm, 



implemented in the MapReduce framework, and deployed over a 200 billion word crawl of 
the Web to compute pairwise similarities between terms. Their results are still dependent 
upon the dimension and they only provide experimental evaluations of their algorithms. 



20051 : ISahami and Heilmanl . 120061 : ISpertus et all 120051 ). These applications of clustering 



O ther related work includes clustering of web data (IBeeferman and Bergerl . l2000l : IChien and Immorlical . 



typically employ relatively straightforward exact algorithms or approximation methods for 
computing similarity. Our work could be leveraged by these applications for improved per- 
formance or higher accuracy through reliance on pro ven results. We hav e also used DISCO 
in finding similar users in a production environment iGupta et al.l (|2013l ) . 
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5. Algorithms and Methods 

The Naive algorithm for computing similarities first computes dot products, then simply 
divides by whatever is necessary to obtain a similarity score, i.e. in a MapReduce imple- 
mentation: 

1. Given document t, Map using NaiveMapper (Algorithm [1]) 

2. Reduce using the NaiveReducer (Algorithm [2]) 



Algorithm 1 NaiveMapper(i) 


for all pairs {w\,W2) in t do 




emit ((wi,w 2 ) ->■ 1) 




end for 






Algorithm 2 NaiveReducer ( (wi, w 2 ), (r\, . . 


•,r«» 


a = Li=i n 




OUtput ; a = 





The above steps will compute all dot products, which will then be scaled by appropriate 
factors for each of the similarity scores. Instead of using naive algorithm, we modify the 
mapper of the naive algorithm and replace it with Algorithm and replace the reducer with 
Algorithm U] to directly compute the actual similarity score, not dot products. The following 
sections detail how to obtain dimensionality independence for each of the similarity scores. 

Algorithm 3 DISCOMapper(t) 
for all pairs {w\,W2) in t do 

emit using custom Emit function 
end for 



Algorithm 4 DISCOReducer((w;i, W2), (ri,... ,rjt)) 

a = 1a=i r i 
output af 



5.1 Cosine Similarity 

To remove the dependence on N, we replace the emit function with Algorithm [5j 

Algorithm 5 CosineSampleEmit^i, w-z) - p/e is the oversampling parameter 
With probability 

P 1 

emit ((wi,W2) — > 1) 
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Note that the more popular a word is, the less likely it is to be output. This is the key 
observation leading to shuffle size independent of the dimension. We use a slightly different 
reducer, which instead of calculating #(^1,1^2), computes cos^i, W2) directly without any 
intermediate steps. The exact estimator is given in the proof of Theorem [TJ 

Since the CosineSampleEmit algorithm is only guaranteed to produce the correct sim- 
ilarity score in expectation, we must show that the expected value is highly likely to be 
obtained. This guarantee is given in Theorem [TJ 

Theorem 1 For any two words x and y having cos(x, y) > e, let Xi,X2, ■ ■ ■ ,X^ X ^ rep- 
resent indicators for the coin flip in calls to CosineSampleEmit with x, y parameters, and 
let X = J2f=i V) x i- For anyl>6>0, we have 



Pr 



-X > (1 + 5)cos(x,y) 



< 



(1 + <5)( 1 + 5 ) 



and 



Pr 



-X < (1 - 5) cos(x, y) 



P 



< exp(-p5 2 /2) 



Proof We use ^X as the estimator for cos(a;,y). Note that 



V 



[x xy = E[X] = #(x,y)- 



1 



e \/#M\/#(y) 

Thus by the multiplicative form of the Chernoff bound, 



- cos(x, y) > p 
e 



Pr 



-X > (1 + 5)cos(x,y) 
VP 



Pr 



-X > (1 + 8)--cos(x,y) 
p p e 



Pr [X > (1 + 5)ii xy ] < 



f-lxy 



< 



;i + ,5)( 1 + 5 ) 



Similarly, by the other side of the multiplicative Chernoff bound, we have 



Pr 



-X < (1 - S) cos(x, y) 
P 



Pr[X < (1 - 6)fJL xy ] < exp(-fi xy 5 2 /2) < exp(-p6 2 /2) 



Since there are (^) pairs of words in the dictionary, set p = log(D 2 ) = 21og(Z?) and 
use union bound with theorem [TJ to ensure the above bounds hold simultaneously for all 
pairs x,y having cos(x,y) > e. It is worth mentioning that ^ (i+gja+a) ) an( ^ exp( — 5 2 /2) 
are always less than 1, thus raising them to the power of p brings them down exponentially 
in p. Note that although we decouple p and e for the analysis of shuffle size, in a good 
implementation that are tightly coupled and the expression p/e can be thought of as a 
single parameter to tradeoff accuracy and shuffle size. In a large scale experiment described 
in section [9TT1 we set p/e in a coupled manner. 
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Now we show that the shuffle size is independent of N, which is a great improvement 
over the naive approach when N is large. To see the usefulness of these bounds, it is worth 
noting that we prove they are almost optimal (i.e. no other algorithm can do much better). 
In Theorem [2] we prove that any algorithm that purports to accurately calculate highly 
similar pairs must at least output them, and sometimes there are at least DL such pairs, 
and so any algorithm that is accurate on highly similar pairs must have at least DL shuffle 
size. We are off optimal here by only a log(D)/e factor. 

Theorem 2 The expected shuffle size for CosineSampleEmit is 0(DL log(D) / e) and Q(DL) . 
Proof The expected contribution from each pair of words will constitute the shuffle size: 

D D #{wi,Wj) 

Pr [CosineSampleEmit (wj, wf)] 

i=l j=i+l k=l 
D D 

= #(wi, w^PrfCosineSampleEmit^i, Wj)] 

i=l j=i+l 

= ^2 j2 - 



i=l ,3=1+1 
D D 



D D 

i — i j — i 

D 

^ -E^T^ L #(^) = 1lD = 0(DLlog(D)/e) 

The first inequality holds because of the Arithmetic-Mean-Geometric-Mean inequality 
applied to {l/#(wi), l/#(wj)}. The last inequality holds because W{ can co-occur with at 
most #(u>i)L other words. It is easy to see via Chernoff bounds that the above shuffle size 
is obtained with high probability. 

To see the lower bound, we construct a dataset consisting of D/L distinct documents of 
length L, furthermore each document is duplicated L times. To construct this dataset, con- 
sider grouping the dictionary into D/L groups, each group containing L words. A document 
is associated with every group, consisting of all the words in the group. This document is 
then repeated L times. In each group, it is trivial to check that all pairs of words of have 
similarity exactly 1. There are (2) P arr s for each group and there are D/L groups, making 
for a total of (D/L)(%) = U(DL) pairs with similarity 1, and thus also at least e. Since any 
algorithm that purports to accurately calculate highly-similar pairs must at least output 
them, and there are £l(DL) such pairs, we have the lower bound. ■ 
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It is important to observe what happens if the output 'probability' is greater than 1. 
We certainly Emit, but when the output probability is greater than 1, care must be taken 
during reducing to scale by the correct factor, since it won't be correct to divide by p/e, 
which is the usual case when the output probability is less than 1. Instead, we must divide 
by \J #{wi)y / #(ui2) because for the pairs where the output probabilty is greater than 1, 
CosineSampleEmit and Emit are the same. Similar corrections have to be made for the 
other similarity scores (Dice and Overlap, but not MinHash), so we do not repeat this 
point. Nonetheless it is an important one which arises during implementation. 

Finally it is easy to see that the largest reduce key will have at most p/e = 0(log(D)/e) 
values. 

Theorem 3 The expected number of values mapped to a single key by DISCOMapper is 
p/e. 

Proof Note that the output of DISCOReducer is a number between and 1. Since this is 
obtained by normalizing the sum of all values reduced to the key by p/e, and all summands 
are at most 1, we trivially get that the number of summands is at most p/e. ■ 



5.2 Jaccard Similarity 

Traditionally MinHash ( Broder . 19971 ) is used to compute Jaccard similarity scores between 



all pairs in a dictionary. We improve the MinHash scheme to run much more efficiently 
with a smaller shuffle size. 

Let h(t) be a hash function that maps documents to distinct numbers in [0, 1], and for 
any word w define g{w) (called the MinHash of w) to be the minimum value of h{t) over all 
t that contain w. Then g(wi) = g{ui2) exactly when the minimum hash value of the union 
with size #(u>i) + #(u>2) — #(^1,^2) lies in the intersection with size #(u>i,u>2). Thus 

Vr[g{wi) = g(w 2 )] = —, — ^j^'^ 2 ^,, r = Jac(wi,w 2 ) 

Therefore the indicator random variable that is 1 when g{uii) = g{ui2) has expectation 
equal to the Jaccard similarity between the two words. Unfortunately it has too high a 
variance to be useful on its own. The idea of the MinHash scheme is to reduce the variance 
by averaging together k of these variables constructed in the same way with k different hash 
functions. We index these k functions using the notation gj(w) to denote the MinHash of 
hash function hj(t). We denote the computation of hashes as 'MinHashMap'. Specifically, 
MinHashMap is defined as Algorithm [6j 

To estimate Jac(w;i, W2) using this version of the scheme, we simply count the number of 
hash functions for which g{ui\) = g{w2), and divide by k to get an estimate of Jac(u>i, Wz)- 
By the multiplicative Chernoff bound for sums of 0-1 random variables as seen in Theorem 
[H setting k = 0(l/e) will ensure that w.h.p. a similarity score that is abov e e has relative 
error no more than 5. Qualitatively, this theorem is the same as given in dBroderl . fl997h 



(where MinHash is introduced) and we do not claim the following as new contribution, 
however, we include it for completeness. More rigorously, 
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Theorem 4 Fix any two words x and y having Jac(x, y) > e. Let X±,X2, ■ ■ ■ , X^ represent 
indicators for {g\{x) = g 1 (y),... ,gu(x) = gk{y)} and X = Ya=i x i- For an V 1 > > 
and k = c/e, we have 



Vi[X/k > (l + 6)Jac(x,y)] < 

and 



Pr [X/k < (1 - 5)Jac{x,y)] < exp(- C (5 2 /2) 

Proof We use X/k as the estimator for ,lac(x,y). Note that E[X] = kJac(x,y) 
(c/e)Jac(x, y) > c. Now by standard Chernoff bounds we have 

/ e s \ E[X] ( e s 
Pr [X/k > (1 + 8) Jac(x, y)] = Pr [X > (1 + 6)E[X]\ < — — < 



l + S)(i+S)) -^(1 + ^(1+5) 
Similarly, by the other side of the multiplicative Chernoff bound, we have 

Pr [X/k < (1 - S)3ac(x,y)] < exp(-c<5 2 /2) 



The MapReduce implementation of the above scheme takes in documents and for each 
unique word in the document outputs k hash values. 

Algorithm 6 MinHashMap(i, (w\, . . . , wl)) 
for i = 1 to L do 

for j = 1 to k do 

emit ((wi,j) -> hj(t)) 
end for 
end for 



1. Given document t, Map using MinHashMap (Algorithm [6]) 

2. Reduce using the min reducer 

3. Now that minhash values for all hash functions are available, for each pair we can 
simply compute the number of hash collisions and divide by the total number of hash 
functions 



Algorithm 7 MinHashSampleMap(t, (w±, . . . ,wl)) 
for i = 1 to L do 

for j = 1 to k do 

^ M*) < £ |gr i then 

emit ((wij) -)• hj(t)) 
end if 
end for 
end for 



11 



BOSAGH ZADEH AND GOEL 



Recall that a document has at most L words. This naive Mapper will have shuffle 
size NLk = O(NLfe), which can be improved upon. After the map phase, for each of 
the k hash functions, the standard MinReducer is used, which will compute gj(w). These 
MinHash values are then simply checked for equality. We modify the initial map phase, 
and prove that the modification brings down shuffle size while maintaining correctness. The 
modification is seen in algorithm [71 note that c is a small constant we take to be 3. 

We now prove that MinHashSampleMap will with high probability Emit the minimum 
hash value for a given word w and hash function h, thus ensuring the steps following 
MinHashSampleMap will be unaffected. 

Theorem 5 If the hash functions h±,...,hk map documents to [0, 1] uniform randomly, 
and c = 3, then with probability at least 1 — / Dk yi , for all words w and hash functions 
h 6 {hi, . . . , hk}, MinHashSampleMap will emit the document that realizes g(w). 

Proof Fix a word w and hash function h and let z = mm t i wet h(t). Now the probability 
that MinHashSampleMap will not emit the document that realizes g(w) is 



Pr 



clog(Dk) 

z> #H 



_ c\og(Dk) \ # {w) clog{Dk) J_ 

#H J ~ (Dky 



Thus for a single w and h we have shown MinHashSampleMap will w.h.p. emit the hash 
that realizes the MinHash. To show the same result for all hash functions and words in the 
dictionary, set c = 3 and use union bound to get a (-j^:) 2 bound on the probability of error 
for any w\, wd and hi, h^. ■ 

Now that we have correctness via theorems H] and we move onto calculating the shuffle 
size for MinHashSampleMap. 

Theorem 6 MinHashSampleMap has expected shuffle size 0(Dk\og(Dk)) = 0({D/e) log(D/e)). 

Proof Simply adding up the expectations for the emissions indicator variables, we see the 
shuffle size is bounded by: 

E E ^^#w = ^cio g(D *) 

he{hi,—,hk} w£{wi,...,wd} 

Setting c = 3 and k = 1/e gives the desired bound. ■ 



All of the reducers used in our algorithms are associative and commutative operations 
(sum and min), and therefore can be combined for optimization. Our results do not change 
qualitatively when combiners are used, except for one case. A subtle point arises in our 
claim for improving MinHash. If we combine with m mappers, then the naive MinHash 
implementation will have a shuffle size of 0{mDk) whereas DISCO provides a shuffle size 
of 0(Dk\og(Dk)). Since m is usually set to be a very large constant (one can easily use 
10,000 mappers in a standard Hadoop implementation), removing the dependence on m is 
beneficial. For the other similarity measures, combining the Naive mappers can only bring 
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down the shuffle size to 0(mD 2 ), which DISCO improves upon asymptotically by obtaining 
a bound of 0(DL/e log (D)) without even combining, so combining will help even more. In 
practice, DISCO can be easily combined, ensuring superior performance both theoretically 
and empirically. 

6. Cosine Similarity in Streaming Model 

We briefly depart from the MapReduce framework and instead work in the 'Streaming' 
framework. In this setting, data is streamed dimension-by-dimension through a single ma- 
chine that can at any time answer queries of the form "what is the similarity between points 
x and y considering all the input so far?". The main performance measure is how much 
memory the machine uses and queries will be answered in constant time. We describe the 
algorithm only for cosine similarity, and an almost identical algorithm will work for dice, 
and overlap similarity. 

Our algorithm will be very similar to the mapreduce setup, but in place of emitting 
pairs to be shuffled for a reduce phase, we instead insert them into a hash map H , keyed by 
pairs of words, with each entry holding a bag of emissions. H is used to track the emissions 
by storing them in a bag associated with the pair. Since all data streams through a single 
machine, for any word x, we can keep a counter for This will take D counters worth 

of memory, but as we will see in Theorem 1 this memory usage will be dominated by the size 
of H. Each emission is decorated with the probability of emission. There are two operations 
to be described: the update that occurs when a new dimension (document) arrives, and the 
constant time algorithm used to answer similarity queries. 

Update. On an update we are given a document. For each pair of words x,y in the 
document, with independent coin flips of probability q = - , 1 we lookup the bag 

associated with (x, y) in H and insert q into it. It is important to note that the emission is 
being done with probability q and q is computed using the current values of and #(y). 
Thus if a query comes after this update, we must take into account all new information. 
This is done via subsampling and is explained shortly. It remains to show how to use these 
probabilities to answer queries. 

Query. We now describe how to answer the only query. Let the query be for the 
similarity between words x and y. Recall at this point we know both #(x) and #(y) 
exactly, for the data seen so far. We lookup the pair (x, y) in H, and grab the associated 
bag of emissions. Recall from above that emission is decorated with the probability of 
emission qi for the i'th entry in the bag. Unfortunately qi will be larger than we need it 
to be, since it was computed at a previous time, when fewer occurrences of x and y had 
happened. To remedy this, we independently subsample each of the emissions for the pair 
x,y with coin flips of probability - — , 1 For each pair x,y seen in the input, there 

will be exactly 

pi pi 

probability of surviving emission and the subsampling. Finally, since the pair x, y is seen 
exactly #(x, y) times, the same estimator used in Theorem 2 will have expectation equal to 
cos(x,y). Furthermore, since before subsampling we output in expectation more pairs than 
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CosineSampleEmit, the Chernoff bound of Theorem 2 still holds. Finally, to show that H 
cannot grow too large, we bound its size in Theorem [71 

Theorem 7 The streaming algorithm uses at most 0{DL\g(N)\og(D) / e) memory. 

Proof We only need to bound the size of H. Consider a word x and all of its occurrences 
in documents ti, . . . ,tM x ) at final time (i.e. after all N documents have been processed). 
We conceptually and only for this analysis construct a new larger dataset C where each 
word x is removed and its occurrences are replaced in order with [lg#(^)J + 1 new words 
xi, . . . , £[ig#(a:)J+i> so we are effectively segmenting (in time) the occurrences of x. With 
this in mind, we construct C so that each Xi will replace 2* _1 occurrences of x, in time 
order, i.e. we will have #(xj) = 2 l_1 . 

Note that our streaming algorithm updates the counters for words with every update. 
Consider what happens if instead of updating every time, the streaming algorithm somehow 
in advance knew and used the final values after all documents have been processed. 

We call this the 'all-knowing' version. The size of H for such an all-knowing algorithm is 
the same as the shuffle size for the DISCO sampling scheme analyzed in Theorem 2, simply 
because there is a bijection between the emits of CosineSampleEmit and inserts into H with 
exactly the same coin flip probability. We now use this observation and C . 

We show that the memory used by H when our algorithm is run on C' with the all- 
knowing counters dominates (in expectation) the size of H for the original dataset in the 
streaming model, thus achieving the claimed bound in the current theorem statement. 

Let PrHashMapInsert(x, y) denote the probability of inserting the pair x,y when we 
run the all-knowing version of the streaming algorithm on C. Let PrStreamEmit(x, y, a, b) 
denote the probability of emitting the pair x, y in the streaming model with input C, after 
observing x, y exactly a, b times, respectively. With these definitions we have 

P 1 

PrStreamEmit(a;, y, a, b) = -= 

e Vab 

< P 1 < P \ 

- e 2 Us a J 2 Us & J ~ e fTTTZ \ fTTTZ \ 

^ W{ x [lga\) \J W{X[lgb\) 

= PrHashMapInsert (a; L ig a j + i , y [lg 6J +i ) 
The first inequality holds by properties of the floor function. The second inequality holds 
by definition of C. The dictionary size for C 1 is 0{D\g{N)) where D is the original dic- 
tionary size for C. Using the same analysis of Theorem 2, the shuffle size for C is at most 
0{DL\g{N) \og{D)/e) and therefore so is the size of H for the all-knowing algorithm run 
on C, and by the analysis above, so is the hash map size for the original dataset C. ■ 



7. Correctness and Shuffle Size Proofs for other Similarity Measures 
7.1 Overlap Similarity 

Overlap similarity follows the same pattern as we used for cosine similarity, thus we only 
explain the parts that are different. The emit function changes to Algorithm [HI 
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Algorithm 8 Over lapSampleEmit^i, W2) - p/e is the oversampling parameter 
With probability 

p 1 



emin(#(roi),#(w 2 )) 

emit ((u>i, w^) — > 1) 



The correctness proof is nearly identical to cosine similarity so we do not restate it. The 
shuffle size for Overlap SampleEmit is given by the following theorem. 

Theorem 8 The expected shuffle size for Overlap SampleEmit is 0{DL\og(D)/e). 

Proof The expected contribution from each pair of words will constitute the shuffle size: 

D D 



p #(Wi,Wj) 

hi Mi emin (#M,#K0) 

2p D 1 D 

£ ti fri 

^ — E T7—^ L #( W *) = ~ LD = °( DL lo s(^)A) 
e f^#(wi) e 

The first inequality holds trivially. The last inequality holds because W{ can co-occur 
with at most #(wi)L other words. It is easy to see via Chernoff bounds that the above 
shuffle size is obtained with high probability. ■ 



7.2 Dice Similarity 

Dice similarity follows the same pattern as we used for cosine similarity, thus we only explain 
the parts that are different. The emit function changes to Algorithm [9j 

Algorithm 9 DiceSampleEmit W2) - p/e. is the oversampling parameter 
With probability 



emit ((u>i, W2) —s- 1) 



The correctness proof is nearly identical to cosine similarity so we do not restate it. The 
shuffle size for DiceSampleEmit is given by the following theorem. 

Theorem 9 The expected shuffle size for DiceSampleEmit is 0{DL\og(D)/e). 
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Proof The expected contribution from each pair of words will constitute the shuffle size: 

D D 



2 E E ~ 



D 



iU #(«*) 



^ - E 7TT^ L #( w i) = -LD = 0(DL \og(D)/e) 

The first inequality holds trivially. The last inequality holds because W{ can co-occur 
with at most #(wi)L other words. It is easy to see via Chernoff bounds that the above 
shuffle size is obtained with high probability. ■ 



8. Cross Product 

Consider the case that we have N vectors (and not D) in M. N and wish to compute the cross 
product of all similarities between a subset D of the N vectors with the remaining N — D 
vectors. This is different than computing all („) similarities which has been the focus of 
the previous sections. 

It is worth a brief mention that the same bounds and almost the same proof of Theorem 
[2] hold for this case, with one caveat, that the magnitudes of all N points must be available 
to the mappers and reducers. We can mitigate this problem by annotating each dimension 
with the magnitudes of points beforehand in a separate MapReduce job. However, this 
MapReduce job will have a shuffle size dependent on N, so we do not consider it a part of 
the DISCO scheme. 



9. Experiments 

We use data from the social networking site Twitter. Twitter is currently a very popular 
social network platform. Users interact with Twitter through a web interface, instant 
messaging clients, or sending mobile text messages. Public updates by users are viewable to 
the world, and a large majority of Twitter accounts are public. These public tweets provide 
a large real-time corpus of what is happening in the world. 

We have two sets of experiments. The first is a large scale experiment that is in produc- 
tion, to compute similar users in the Twitter follow graph. The second is an experiment to 
find similar words. The point of the first experiment is show the scalability of our algorithms 
and the point of the second set of experiments is to show accuracy. 

9.1 Similar Users 

Consider the problem of finding all pairs of similarities between a subset of D = 10 twitter 
users. We would like to find all (^) similarities between these users. The number of 
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dimensions of each user is larger than D, and is denoted N = 5 x 10 8 . We used p/e = 100. 
We have a classifier trained to pick the edges with most interaction, and thus can limit the 
sparsity of the dataset with L = 1000. 

We compute these user similariti es daily in a produ ction environment using Hadoop 



and Pig implementations at Twitter (lOlston et all 120081 ). with the results supplementing 
Twitter.com. Using 100 Hadoop nodes we can compute the above similarities in around 
2 hours. In this case, the naive scheme would have shuffle size NL 2 = 5 x 10 , which is 
infeasible, however it becomes possible with DISCO. 



9.2 Similar Words 

In the second experiment we used all tweets seen by Twitter.com in one day. The point in 
the experiment is to find all pairs of similar words, where words are similar if they appear 
in the same tweets. The number of dimensions N = 198, 134, 530 in our data is equal to the 
number of tweets and each tweet is a document with size at most 140 characters, providing 
a small upper bound for L. These documents are ideal for our framework, since our shuffle 
size upper bound depends on L, which in this case is very small. We used a dictionary 
of 1000 words advertisers on Twitter are currently targeting to show Promoted Trends, 
Trends, and Accounts. We also tried a uniformly random sampled dictionary without a 
qualitative change in results. 

The reason we only used D = 1000 is because for the purpose of validating our work 
(i.e. reporting the small errors occurred by our algorithms), we have to compute the true 
cosine similarities, which means computing true co-occurence for every pair, which is a 
challenging task computationally. This was a bottleneck only in our experiments for this 
paper, and does not affect users of our algorithms. We ran experiments with D = 10 6 , but 
cannot report true error since finding the true cosine similarities are too computationally 
intensive. In this regime however, our theorems guarantee that the results are good with 
high probability. 



9.3 Shuffle Size vs Accuracy 

We have two parameters to tweak to tradeoff accuracy and shuffle size: e and p. However, 
since they only occur in our algorithms as the ratio p/e, we simply use that as the tradeoff 
parameter. The reason we separated p and e was for the theorems to go through nicely, but 
in reality we only see a single tweaking parameter. 

We increase p/e exponentially on the x axis and record the ratio of DISCO shuffle size 
to the naive implementation. In all cases we can achieve a 90% reduction in shuffle size 
from the naive implementation without sacrificing much accuracy, as see in Figures [21 
and[6l The accuracy we report is with respect to true cosine, dice, and overlap similarity. 
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Figure 1: Average error for all pairs with similarity > e. DISCO estimated Cosine error 
decreases for more similar pairs. Shuffle reduction from naive implementation: 
99.39%. 



DISCO Cosine shuffle size vs accuracy tradeoff 




Figure 2: As p/e increases, shuffle size increases and error decreases. There is no threshold- 
ing for highly similar pairs here. Ground truth is computed with naive algorithm. 
When the true similarity is zero, DISCO always also returns zero, so we always 
get those right. It remains to estimate those pairs with similarity > 0, and that 
is the average relative error for those pairs that we report here. 



9.4 Error vs Similarity Magnitude 

All of our theorems report better accuracy for pairs that have higher similarity than other- 
wise. To see this empirically, we plot the average error of all pairs that have true similarity 
above e. These can be seen in Figures [Q [3j [7J and Note that the reason for large 
portions of the error being constant in these plots is that there are very few pairs with very 
high similarities, and therefore the error remains constant while e is between the difference 
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DISCO Dice Similarity 
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Figure 3: Average error for all pairs with similarity > e. DISCO estimated Dice error 
decreases for more similar pairs. Shuffle reduction from naive implementation: 
99.76%. 



DISCO Dice shuffle size vs accuracy tradeoff 
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Figure 4: As p/e increases, shuffle size increases and error decreases. There is no thresh- 
olding for highly similar pairs here. 



of two such very high similarity pairs. Further note that Figure [7] and [S] are so similar 
because our proposed DISCO MinHash very closely mimics the original MinHash. This is 
reflected in the strength of the bound in Theorem [SJ 

10. Conclusions and Future Directions 

We presented the DISCO suite of algorithms to compute all pairwise similarities between 
very high dimensional sparse vectors. All of our results are provably independent of di- 
mension, meaning apart from the initial cost of trivially reading in the data, all subsequent 
operations are independent of the dimension, thus the dimension can be very large. 
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DISCO Overlap Similarity 
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Figure 5: Average error for all pairs with similarity > e. DISCO estimated Overlap error 
decreases for more similar pairs. Shuffle reduction from naive implementation: 
97.86%. 



DISCO Overlap shuffle size vs accuracy tradeoff 




Figure 6: As p/e increases, shuffle size increases and error decreases. There is no thresh- 
olding for highly similar pairs here. 



Although we use the Map Reduce ( Dean and Ghemawat . 20081 ) and Streaming compu- 
tation models to discuss shuffle size and memory, the sampling strategy we use can be 
generalized to other frameworks. We anticipate the DISCO sampling strategy to be useful 
whenever one is computing a number between and 1 by taking the ratio of an unknown 
number (t he dot product in our case) by some known number (e.g. \J #{x)#(y) for cosine 
similarity) Zadeh and Carlsson ( 20131 ). This is a high-level description, and in order to 
make it more practical, we give five concrete examples, along with proofs, and experiments. 
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MinHash Similarity 



similarity threshold 



Figure 7: Average error for all pairs with similarity > e. MinHash Jaccard similarity error 
decreases for more similar pairs. We are computing error with MinHash here, not 
ground truth. 
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Figure 8: Average error for all pairs with similarity > e. DISCO MinHash Jaccard similarity 
error decreases for more similar pairs. We are computing error with MinHash 
here, not ground truth. 
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